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Abstract — In  this  paper,  we  present  the  theoretical  foundation 
for  optimal  classification  using  class-specific  features  and  provide 
examples  of  its  use.  A  new  probability  density  function  (PDF)  pro¬ 
jection  theorem  makes  it  possible  to  project  probability  density 
functions  from  a  low-dimensional  feature  space  back  to  the  raw 
data  space.  An  M- ary  classifier  is  constructed  by  estimating  the 
PDFs  of  class-specific  features,  then  transforming  each  PDF  back 
to  the  raw  data  space  where  they  can  be  fairly  compared.  Although 
statistical  sufficiency  is  not  a  requirement,  the  classifier  thus  con¬ 
structed  will  become  equivalent  to  the  optimal  Bayes  classifier  if 
the  features  meet  sufficiency  requirements  individually  for  each 
class.  This  classifier  is  completely  modular  and  avoids  the  dimen¬ 
sionality  curse  associated  with  large  complex  problems.  By  recur¬ 
sive  application  of  the  projection  theorem,  it  is  possible  to  analyze 
complex  signal  processing  chains.  We  apply  the  method  to  feature 
sets  including  linear  functions  of  independent  random  variables, 
cepstrum,  and  MEL  cepstrum.  In  addition,  we  demonstrate  how 
it  is  possible  to  automate  the  feature  and  model  selection  process 
by  direct  comparison  of  log-likelihood  values  on  the  common  raw 
data  domain. 

Index  Terms — Bayesian  classification,  class-dependent  features, 
classification,  class-specific  features,  hidden  Markov  models,  max¬ 
imum  likelihood  estimation,  pattern  classification,  PDF  estimation, 
probability  density  function. 


I.  Introduction 
A.  Oven’iew  and  Outline 

IN  this  paper,  we  introduce  a  theorem  that  can  be  applied  to 
any  statistical  approach,  which  makes  use  of  likelihood  com¬ 
parisons,  such  as  detection,  classification,  and  statistical  mod¬ 
eling.  The  theorem  allows  likelihood  comparisons  to  be  made 
in  the  common  raw  data  domain  while  the  difficult  task  of  prob¬ 
ability  density  function  (PDF)  estimation  can  be  made  in  class 
(or  state)  dependent  low-dimensional  feature  spaces.  Because 
each  feature  set  can  be  designed  without  regard  to  other  classes 
(or  states),  it  can  be  of  much  lower  dimension  than  a  common 
feature  set  that  must  account  for  all  classes,  effectively  avoiding 
the  curse  of  dimensionality.  The  transformation  of  feature  PDFs 
to  the  raw  data  domain,  which  we  term  “PDF  projection,”  is 
accomplished  by  deriving  a  correction  term  that  amounts  to  a 
generalized  Jacobian  of  the  feature  transformation.  This  correc¬ 
tion  term  depends  only  upon  the  feature  transformation  and  a 
hand-picked  class  (or  state)  dependent  statistical  reference  hy¬ 
pothesis.  When  combined  with  the  feature  likelihood  value,  it 
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results  in  a  raw  data  likelihood  function  which  is  guaranteed  by 
the  theorem  to  be  a  PDF  on  the  raw  data  space.  Examples  of  the 
method  involving  commonly  used  autoregressive  and  cepstrum 
features  are  provided. 

A  few  words  about  the  chronology  of  development  are  in 
order.  This  paper  is  based  on  previous  work  in  class-specific 
features  by  the  author  and  by  Prof.  S.  Kay  at  the  University  of 
Rhode  Island.  The  first  two  papers  on  the  subject  [1],  [2]  de¬ 
scribe  the  original  form  of  the  class-specific  method,  which  was 
based  on  a  common  fixed  reference  hypothesis  and  the  proper¬ 
ties  of  sufficient  statistics.  Although  the  present  method  is  based 
on  this  previous  work,  we  say  little  about  it  in  this  paper.  This  is 
because  the  present  work  is  best  understood  from  the  viewpoint 
of  PDF  projection,  and  it  would  confuse  the  readers  to  begin 
with  sufficient  statistics.  The  interested  reader  is  encouraged  to 
examine  this  previous  work,  especially  [2], 

In  Section  I,  we  review  classical  Bayesian  classification  and 
discuss  the  dimensionality  problem.  In  Section  II,  we  introduce 
the  PDF  projection  theorem  (PPT)  and  the  associated  chain  rule. 
In  Section  III,  we  discuss  various  methods  of  calculating  the 
PPT  correction  term.  In  Section  IV,  we  discuss  how  to  apply  the 
PPT  to  classification.  In  Section  V,  we  apply  the  method  to  fea¬ 
ture  transformations  involving  linear  functions  of  independent 
random  variables  (RVs).  In  Section  VI,  we  apply  the  method 
to  cepstrum  and  MEL  cepstrum.  In  Section  VII,  we  present  a 
method  of  automatic  feature  selection. 

B.  Classical  Classification  Theory  and  the  Dimensionality 
Problem 

The  so-called  M-ary  classification  problem  is  that  of  as¬ 
signing  a  multidimensional  sample  of  data  x  £  1ZN  to  one 
of  M  classes.  The  statistical  hypothesis  that  class  j  is  true  is 
denoted  by  Hj,  1  <  j  <  M.  The  statistical  characterization  of 
x  under  each  of  the  M  hypotheses  is  described  completely  by 
the  PDFs,  which  are  written  p(x\Hj),  1  <  j  <  M.  Classical 
theory  as  applied  to  the  problem  results  in  the  so-called  Bayes 
classifier,  which  simplifies  to  the  Neyman-Pearson  rule  for 
equiprobable  prior  probabilities 

j*  =  argmaxp(x|Ffj).  (1) 

j 

Because  this  classifier  attains  the  minimum  probability  of  error 
of  all  possible  classifiers,  it  is  the  basis  of  most  classifier  de¬ 
signs.  Unfortunately,  it  does  not  provide  simple  solutions  to  the 
dimensionality  problem  that  arises  when  the  PDFs  are  unknown 
and  must  be  estimated.  The  most  common  solution  is  to  reduce 
the  dimension  of  the  data  by  extraction  of  a  small  number  of 
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information-bearing  features  z  =  T(x),  then  recasting  the  clas¬ 
sification  problem  in  terms  of  z: 

j*  =  argmaxp(z|i/j).  (2) 

3 

This  leads  to  a  fundamental  tradeoff:  whether  to  discard  features 
in  an  attempt  to  reduce  the  dimension  to  something  manage¬ 
able  or  to  include  them  and  suffer  the  problems  associated  with 
estimating  a  PDF  at  high  dimension.  Unfortunately,  there  may 
be  no  acceptable  compromise.  Virtually  all  methods  which  at¬ 
tempt  to  find  decision  boundaries  on  a  high-dimensional  space 
are  subject  to  this  tradeoff  or  “curse”  of  dimensionality.  For  this 
reason,  many  researchers  have  explored  the  possibility  of  using 
class-specific  features  [3]— [9] . 

The  basic  idea  in  using  class-specific  features  is  to  extract 
M  class-specific  feature  sets  z  j  =  Ty(x),  1  <  j  <  M, 
where  the  dimension  of  each  feature  set  is  small,  and  then  to 
arrive  at  a  decision  rule  based  only  upon  functions  of  the  lower 
dimensional  features.  Unfortunately,  the  classifier  modeled  on 
the  Neyman-Pearson  rule 

j*  =  argmaxp(zy|.0y)  (3) 

j 

is  invalid  because  comparisons  of  densities  on  different  feature 
spaces  are  meaningless.  One  of  the  first  approaches  that  comes 
to  mind  is  to  computes  for  each  class  a  likelihood  ratio  against 
a  common  hypothesis  composed  of  “all  other  classes.”  While 
this  seems  beneficial  on  the  surface,  there  is  no  theoretical  di¬ 
mensionality  reduction  since  for  each  likelihood  ratio  to  be  a 
sufficient  statistic,  “all  features”  must  be  included  when  testing 
each  class  against  a  hypothesis  that  includes  “all  other  classes.” 
A  number  of  other  approaches  have  emerged  in  recent  years  to 
arrive  at  meaningful  decision  rules.  Each  method  makes  a  strong 
assumption  (such  as  that  the  classes  fall  into  linear  subspaces) 
that  limits  the  applicability  of  the  method  or  else  uses  ad  hoc 
method  of  combining  the  likelihoods  of  the  various  feature  sets. 

1)  A  method  used  in  speech  recognition  [3]  uses  phone-spe¬ 
cific  features.  While,  at  first,  this  method  appears  to  use 
class-specific  features,  it  is  actually  using  the  same  fea¬ 
tures  extracted  from  the  raw  data  but  applyoing  different 
models  to  the  time  evolution  of  these  features. 

2)  A  method  of  image  recognition  [10]  uses  class-specific 
features  to  detect  various  image  “fragments.”  The  method 
uses  a  nonprobabilistic  means  of  combining  fragments  to 
form  an  image. 

3)  A  method  has  been  proposed  that  tests  all  pairs  of  classes 
[4],  To  be  exhaustive,  this  method  has  a  complexity  of 
0(M2)  different  tests  and  may  be  prohibitive  for  large 
M.  A  hierarchical  approach  has  been  proposed  based  on 
a  binary  tree  of  tests  [5],  Implementation  of  the  binary 
tree  requires  initial  classification  into  meta-classes,  which 
is  an  approach  that  is  suboptimal  because  it  makes  hard 
decisions  based  on  limited  information. 

4)  Methods  based  on  linear  subspaces  [6] ,  [7]  are  popular  be¬ 
cause  they  use  the  powerful  tool  of  linear  subspace  anal¬ 
ysis.  These  methods  can  perform  well  in  certain  applica¬ 
tions  but  are  severely  limited  to  problems  where  when  the 
classes  are  separable  by  linear  processing. 


5)  Support  vectors  [8]  are  a  relatively  new  approach  that  is 
based  on  finding  a  linear  decision  function  between  every 
pair  of  classes. 

As  evidenced  by  the  various  approaches,  there  is  a  strong  moti¬ 
vation  for  using  class-specific  features.  Unfortunately,  classical 
theory  as  it  stands  requires  operating  in  a  common  feature  space 
and  fails  to  provide  any  guidance  for  a  suitable  class-specific  ar¬ 
chitecture.  In  this  paper,  we  present  an  extension  to  the  classical 
theory  that  provides  for  an  optimal  architecture  using  class-spe¬ 
cific  features. 

II.  PDF  Projection  Theorem 

It  is  well  known  how  to  write  the  PDF  of  x  from  the  PDF  of 
z  when  the  transformation  is  1:1.  This  is  the  change  of  variables 
theorem  from  basic  probability.  Let  z  =  T(x),  where  T(x)  is  an 
invertible  and  differentiable  multidimensional  transformation. 
Then 

Px(x)  =  |J(x)|pz(T(x))  (4) 

where  |J(x)|  is  the  determinant  of  the  Jacobian  matrix  of  the 
transformation 

T  -  dzi 
,:i  dxj: 

What  we  seek  is  a  generalization  of  (4),  which  is  valid  for 
many-to-1  transformations.  Define 

V{T,pz)  =  {^(x)  :  z  =  T(x)  and  z  ~  pz( z)} 

that  is,  V(T,pz)  is  the  set  of  PDFs  ^(x),  which,  through  T(x), 
generate  PDF pz( z)  on  z.  If  T(  )  is  many-to-one,  V(T,pz)  will 
contain  more  than  one  member.  Therefore,  it  is  impossible  to 
uniquely  determine  px(x)  from  T(  )  and  pz( z).  We  can,  how¬ 
ever,  find  a  particular  solution  if  we  constrain  px  (x) .  In  order  to 
apply  the  constraint,  it  is  necessary  to  make  use  of  a  reference 
hypothesis  Hq  for  which  we  know  the  PDF  of  both  x  and  z.  If 
we  constrain  px  (x)  such  that  for  every  transform  pair  (x,  z)  we 
have 

Px(x)  =  Pz{z)  r5 

px(x.\H0)  Pz(z\H0 ) 

or  that  the  likelihood  ratio  (with  respect  to  Ho)  is  the  same  in 
both  the  raw  data  and  feature  domains,  we  arrive  at  a  satisfac¬ 
tory  answer.  We  cannot  offer  a  justification  for  this  constraint 
other  than  it  is  a  means  of  arriving  at  an  answer;  however,  we 
will  soon  show  that  this  constraint  produces  desirable  proper¬ 
ties.  The  particular  form  of  px(x.)  is  uniquely  defined  by  the 
constraint  itself,  namely 

ftr(x)  =  Vx^\H°\pz{z)\  where  z  =  T(x).  (6) 

Pz\Z \hlo) 

Theorem  1  states  that  not  only  is  px  (x)  a  PDF  but  that  it  is  a 
member  of  V(T,pz). 

Theorem  1 — PDF  Projection  Theorem:  Let  Ho  be  some 
fixed  reference  hypothesis  with  known  PDL  px(x.\Ho).  Let  X 
be  the  region  of  support  of  px(x.\Ho).  In  other  words,  X  is  the 
set  of  all  points  x,  where  px(x\Ho)  >  0.  Let  z  =  T(x)  be  a 
many-to-one  transformation.  Let  Z  be  the  image  of  X  under 
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the  transformation  T(x).  Let  the  PDF  of  z  when  x  is  drawn 
from  px(x\Ho)  exist  and  be  denoted  by  pz(z\Ho),  It  follows 
that  pz(z\H0)  >  0  for  all  z  £  Z.  Now,  let  pz(z)  be  any  PDF 
with  the  same  region  of  support  Z.  Then,  the  function  (6)  is  a 
PDF  on  X,  and  thus 


/  px(x)dx  =  1. 
•7xe.T 


Furthermore,  p.,.(x)  is  a  member  of  V(T,pz). 

Proof:  These  assertions  are  proved  in  [11], 

A.  Usefulness  and  Optimality  Conditions  of  the  Theorem 

The  theorem  shows  that,  provided  we  know  the  PDF  under 
some  reference  hypothesis  Ho  at  both  the  input  and  output  of 
transformation  T(x),  if  we  are  given  an  arbitrary  PDF  pz( z) 
defined  on  z,  we  can  immediately  find  a  PDF  p,,.(x)  defined 
on  x  that  generates  pz( z).  Although  it  is  interesting  that  px(x) 
generates  pz( z),  there  are  an  infinite  number  of  them,  and  it  is 
not  yet  clear  that  px(x)  is  the  best  choice.  However,  suppose 
we  would  like  to  use  p.,.(x)  as  an  approximation  to  the  PDF 
px(x.\Hi).  Let  this  approximation  be 

pxfx\Hi)  =  ]pz(z\Hi)  where  z  =  T(x).  (7) 

pz{z\u0) 

From  Theorem  1,  we  see  that  (7)  is  a  PDF.  Furthermore,  if  T(x) 
is  a  sufficient  statistic  for  Hi  versus  Ho,  then  as  pz(z\Hi)  — s- 
pz(z\H\),  we  have 

fe(x|i?i)  — >  px(x.\Hi). 

This  is  immediately  seen  from  the  well-known  property  of  the 
likelihood  ratio,  which  states  that  if  T(x)  is  sufficient  for  H\ 
versus  Ho 

Vx{*\Hi)  =  pz{z\Hi) 
px(x.\H0)  pz(z\Ho) ' 

Note  that  for  a  given  Hi,  the  choice  of  T(x)  and  Ho  are  coupled 
so  that  they  must  be  chosen  jointly.  In  addition,  note  that  the  suf¬ 
ficiency  condition  is  required  for  optimality,  but  is  not  necessary 
for  7  to  be  a  valid  PDF.  Here,  we  can  see  the  importance  of  the 
theorem.  The  theorem,  in  effect,  provides  a  means  of  creating 
PDF  approximations  on  the  high-dimensional  input  data  space 
without  dimensionality  penalty  using  low-dimensional  feature 
PDFs  and  provides  a  way  to  optimize  the  approximation  by  con¬ 
trolling  both  the  reference  hypothesis  Ho  as  well  as  the  features 
themselves.  This  is  the  remarkable  property  of  Theorem  1:  that 
the  resulting  function  remains  a  PDF  whether  or  not  the  features 
are  sufficient  statistics.  Since  sufficiency  means  optimality  of 
the  classifier,  approximate  sufficiency  means  PDF  approxima¬ 
tion  and  approximate  optimality. 

Theorem  1  allows  maximum  likelihood  (ML)  methods  to  be 
used  in  the  raw  data  space  to  optimize  the  accuracy  of  the  ap¬ 
proximation  over  T  and  Ho  as  well  as  6.  Let  pz(z\Hi)  be  pa¬ 
rameterized  by  the  parameter  6.  Then,  the  maximization 


max 

6,t,h0 


Px{^\Hq 

PzizWo, 


^pz(z\Hv,0),  z  =  T(x)j  (7a) 


is  a  valid  ML  approach  and  can  be  used  for  model  selection  (with 
appropriate  data  cross-validation). 

Example  1:  In  this  simple  example,  we  demonstrate  the  ap¬ 
plicability  of  Theorem  1 .  We  consider  the  case  of  independent 
Gaussian  RV’s  and  two  hypotheses  concerning  the  mean.  Let 
x  =  [x\  . . .  Xn]'.  Let  the  feature  transformation  be 

K 

z  =  T(x)  =  y  X, 
i= 1 

where  1  <  K  <  N.  Let  £(xi )  =  0  under  Ho  and  £(xi )  =  1 
under  H\ ,  where  £(  )  is  the  expectation  operator.  Because  Ho 
and  Hi  are  hypotheses  concerning  the  mean  of  Gaussian  RV’s 
with  fixed  variance,  2  is  a  sufficient  statistic  for  the  mean  when 
K  =  N.  The  Gaussian  PDF  of  x  under  Ho  may  be  written 


px(x\H0)  =  (27TCT2)  N/ 2 


exP{-A|>!}. 


Under  Ho,  z  will  be  Gaussian  zero-mean  with  variance  Kg2, 
and  thus 


pz(z\H0)  =  (2ttKo2)  1/2 exp  l  -7^-5 z- 


We  let  pz(z)  be  the  Gaussian  PDF 


pz(z\Hi)  =  (27tAGt2) 


By  the  projection  theorem 


23-1/2 


eXp  -2^-A')2 


,  n  _  Pxix\H_ 0)  /  \pj  \ 

PA  Pz(z\Ho)Pz{  1  1)' 


px(x)  =  (27Rj2)-iV/2(27riLu2)1/2(27riLcr2)-1/2 


■h  Z.^-zk+^-k)2 

_i= 1 


px(*)={2'Ka2rN/2 


1  v — >  9  Z  Z  ^ 

Y.x-k  +  k~2z  +  k 


px(x.)  =  (27 nr2)  N/ 2 


expi_^2  Y,xi~2z+K 


2\-JV/2 


K  N  "I  3 

5Z0u-i)2+  xi  \ 

i=  1  i=K+ 1  J  J 


=  (27RJ 


where  we  have  made  the  substitution  2:  =  xi ■  ^  is  clear 

that  the  result  is  a  Gaussian  PDF  with  mean  p,  =  1  for  1  <i< 
K  and  pi  =  0  for  K  +  1  <  i  <  N.  Note  also  that  it  is  a  PDF, 
regardless  of  K  (that  is  to  say  the  sufficiency  of  z).  It  is  also 
clear  that  the  PDF px(Z)  generates  the  PDF  pz(z).  In  addition, 
note  that  if  K  =  N,  then  px(x.)  =  px(x\Hi),  as  predicted  by 
the  theory. 
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B.  Data-Dependent  Reference  Hypothesis 

We  now  mention  a  useful  property  of  (7).  Let  7iz  be  a  region 
of  sufficiency  (ROS)  of  z,  which  is  defined  as  a  set  of  all  hy¬ 
potheses  such  that  for  every  pair  of  hypotheses  /7o„.  Hob  G  H.z, 
we  have 


In  our  example,  the  maximum  of  the  numerator  clearly  hap¬ 
pens  at  <72  =  z  because  z  is  the  maximum  likelihood  estimator 
of  o2.  We  will  explore  the  relationship  of  this  method  to  asymp¬ 
totic  ML  theory  in  a  later  section.  To  reflect  the  possible  depen¬ 
dence  of  Hq  on  z,  we  adopt  the  notation  Hq(z).  Thus 


px(x\H0a )  _  pz(z\Hoq) 
px(x\H0b)  pz{z\Hob)' 


px(x\Hi) 


px(x\H0(z)) 

/L(z|//o(z)) 


pz(z\Hi),  where  z  =  T(x). 


(9) 


An  ROS  may  be  thought  of  as  a  family  of  PDFs  traced  out  by  the 
parameters  of  a  PDF,  where  z  is  a  sufficient  statistic  for  the  pa¬ 
rameters.  The  ROS  may  or  may  not  be  unique.  For  example,  the 
ROS  for  a  sample  mean  statistic  could  be  a  family  of  Gaussian 
PDFs  with  variance  1  traced  out  by  the  mean  parameter.  Another 
ROS  would  be  produced  by  a  different  variance.  The  “  J-func- 
tion” 

TFrlA  Px(x\H0)  _  px(x\Ho) 

1  0j  Pz(T(x)\H0)  Pz(z\Ho) 

is  independent  of  Ho  as  long  as  Hq  remains  within  ROS  Hz. 

Defining  the  ROS  should  in  no  way  be  interpreted  as  a  suffi¬ 
ciency  requirement  for  z.  All  statistics  z  have  an  ROS  that  may 
or  may  not  include  //  ]  (it  does  only  in  the  ideal  case).  Defining 
Hz  is  used  only  in  determining  the  allowable  range  of  reference 
hypotheses  when  using  a  data-dependent  reference  hypothesis. 

For  example,  let  z  be  the  sample  variance  of  x.  Let  Ho(cr2)  be 
the  hypothesis  that  x  is  a  set  of  N  independent  identically  dis¬ 
tributed  zero-mean  Gaussian  samples  with  variance  a2.  Clearly, 
an  ROS  for  z  is  the  set  of  all  PDFs  traced  out  by  cr2.  We  have 


p(x\H0(cr2))  =  (2trt2)  n/2 


exp  - 


2cr2 


JV 

U 

n= 1 


and,  since  z  is  a  x2( N )  random  variable  (scaled  by  l/N) 


p{z\H0((J2)) 


N  cy—N/2 

-2r(f) 


(5f_1 


exp 


It  is  easily  verified  that  the  contribution  of  a2  is  canceled  in  the 
J-function  ratio. 

Because  J(x,  T,  Ho(cr2))  is  independent  of  cr2,  it  is  possible 
to  make  cr2  a  function  of  the  data  itself,  changing  it  with  each 
input  sample.  In  the  example  above,  since  z  is  the  sample  vari¬ 
ance,  we  could  let  the  assumed  variance  under  Ho  depend  on  z 
according  to  a2  =  z. 

However,  if  J(x,  T,  11  (fa2))  is  independent  of  cr2,  one  may 
question  what  purpose  does  it  serve  to  vary  cr2.  The  reason  is 
purely  numerical.  Note  that  in  general,  we  do  not  have  an  ana¬ 
lytic  form  for  the  J-function  but  instead  have  separate  numer¬ 
ator  and  denominator  terms.  Often,  computing  J(x,  T.  Ho{a2)) 
can  pose  some  tricky  numerical  problems,  particularly  if  x  and  z 
are  in  the  tails  of  the  respective  PDFs.  Therefore,  our  approach  is 
to  position  Ho  to  maximize  the  numerator  PDF  (which  simulta¬ 
neously  maximizes  the  denominator).  Another  reason  to  do  this 
is  to  allow  PDF  approximations  to  be  used  in  the  denominator 
that  are  not  valid  in  the  tails,  such  as  the  central  limit  theorem 
(CLT). 


The  existence  of  z  on  the  right  side  of  the  conditioning  operator 
|  is  admittedly  a  very  bad  use  of  notation  but  is  done  for  sim¬ 
plicity.  The  meaning  of  z  can  be  understood  using  the  following 
imaginary  situation.  Imagine  that  we  are  handed  a  data  sample 
x,  and  we  evaluate  (7)  for  a  particular  hypothesis  Ho  G  ‘Hz. 
Out  of  curiosity,  we  try  it  again  for  a  different  hypothesis  of 
Hq  G  Hz-  We  find  that  no  matter  which  Ho  G  Tiz  we  use,  the 
result  is  the  same.  We  notice,  however,  that  for  an  Ho  that  pro¬ 
duces  larger  values  of  px(x\Ho(z))  and  pz(z\Ho(z)),  the  re¬ 
quirement  for  numerical  accuracy  is  less  stringent.  It  may  re¬ 
quire  fewer  terms  in  a  polynomial  expansion  or  else  fewer  bits 
of  numerical  accuracy.  Now,  we  are  handed  a  new  sample  of  x, 
but  this  time,  having  learned  our  lesson,  we  immediately  choose 
the  Hq  G  Hz  that  maximizes  px(x\Ho(z)).  If  we  do  this  every 
time,  we  realize  that  Hq  is  now  a  function  of  z.  The  dependence, 
however,  carries  no  statistical  meaning  and  only  has  a  numerical 
interpretation. 

In  many  problems,  7iz  is  not  easily  found,  and  we  must  be 
satisfied  with  approximate  sufficiency.  In  this  case,  there  is  a 
weak  dependence  of  J(x,  T,  Hq)  upon  Ho-  This  dependence 
is  generally  unpredictable  unless,  as  we  have  suggested,  Hq{z) 
is  always  chosen  to  maximize  the  numerator  PDF.  Then,  the 
behavior  of  J(x,  T,  Hq)  is  somewhat  predictable.  Because  the 
numerator  is  always  maximized,  the  result  is  a  positive  bias. 
This  positive  bias  is  most  notable  when  there  is  a  good  match  to 
the  data,  which  is  a  desirable  feature. 

C.  Asymptotic  ML  Theory  as  a  Special  Case  of  the  PDF 
Projection  Theorem 

We  have  stated  that  when  we  use  a  data-dependent  refer¬ 
ence  hypothesis,  we  prefer  to  choose  the  reference  hypothesis 
such  that  the  numerator  of  the  J-function  is  a  maximum.  Since 
we  often  have  parametric  forms  for  the  PDFs,  this  amounts  to 
finding  the  ML  estimates  of  the  parameters.  If  there  are  a  small 
number  of  features,  all  of  the  features  are  ML  estimators  for 
parameters  of  the  PDF,  and  there  is  sufficient  data  to  guarantee 
that  the  ML  estimators  fall  in  the  asymptotic  (large  data)  region, 
then  the  data-dependent  hypothesis  approach  is  equivalent  to  an 
existing  approach  based  on  classical  asymptotic  ML  theory.  We 
will  derive  the  well-known  asymptotic  result  using  (9). 

Two  well-known  results  from  asymptotic  theory  [12]  are  the 
following. 

1)  Subject  to  certain  regularity  conditions  (large  amount  of 
data,  a  PDF  that  depends  on  a  finite  number  of  parame¬ 
ters  and  is  differentiable,  etc.),  the  PDF px(x\  6*)  may  be 
approximated  by 


P*(x;  0* )  ~  px(x:  0)  exp  {  - *  (0*  -  0)'l (0)(0*  -  0)  j  (10) 
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where  0*  is  an  arbitrary  value  of  the  parameter,  0  is  the 
maximum  likelihood  estimate  (MLE)  of  0,  and  1(0)  is  the 
Fisher’s  information  matrix  (FIM)  [12],  The  components 
of  the  FIM  for  PDF  parameters  9k ,  0i  are  given  by 


I  ekA{0)  =  -E 


f  d2lnpa(x;6>)\ 

V  dOkdOt  J  ' 


The  approximation  is  valid  only  for  0*  in  the  vicinity  of 
the  MFE  (and  the  true  value). 

2)  The  MFE  0  is  approximately  Gaussian  with  mean  equal 
to  the  true  value  0  and  covariance  equal  to  I_1(0)  or 


pe(0: 0)  ~  (2tt)-p/2|I(0)|1/2  exp  {-1(0-  0)'I(0)(0  -  0)  j 

(II) 

where  P  is  the  dimension  of  6.  Note  that  we  use  0  in 
evaluating  the  FIM  in  place  of  0 ,  which  is  unknown.  This 
is  allowed  because  I-1  (0  )  has  a  weak  dependence  on  0. 
The  approximation  is  valid  only  for  0  in  the  vicinity  of 
the  MFE. 

To  apply  (9),  0  takes  the  place  of  z,  and  Hq(z)  is  the  hy¬ 
pothesis  that  0  is  the  true  value  of  0.  We  substitute  (10)  for 
px(x.\Ho(z))  and  (11)  for  pz(z\Ho(z)).  Under  the  stated  con¬ 
ditions,  the  exponential  terms  in  approximations  (10),  and  (11) 
become  1 .  Using  these  approximations,  we  arrive  at 


Fig.  1.  Required  embedding  of  hypotheses  for  chain-rule  processor 
corresponding  to  (13)  and  (14).  The  condition  Hi  €  7iz  is  not  necessary  for  a 
valid  PDF  but  is  desirable  for  processor  optimality. 


The  recursive  use  of  (7)  gives 


Pa;(x|fJi) 


px(x\H0(y))  %(y|ffp(w)) 
Py(y|#o(y))p™(w|iT'(w)) 


PwMHo(z)) 

P-z{z\H"(z)) 


(14) 


where  y  =  Ti(x),  w  =  T2(y),  z  =  T3( w),  and  H0(y), 
i?o(w),  Hq(z)  are  reference  hypotheses  (possibly  data-depen- 
dent)  suited  to  each  stage  in  the  processing  chain.  By  defining 
the  J-functions  of  each  stage,  we  may  write  the  above  as 


px(x\Hi) 


Px(x:0) 

(27T)-p/2|I(0)|1/2 


(12) 


which  agrees  with  the  PDF  approximation  from  asymptotic 
theory  [13],  [14], 

To  compare  (9)  and  (12),  we  note  that  for  both,  there  is  an  im¬ 
plied  sufficiency  requirement  for  z  and  0 ,  respectively.  Specif¬ 
ically,  Jl(f  z)  must  remain  in  the  ROS  of  z,  whereas  0  must 
be  asymptotically  sufficient  for  0.  However,  (9)  is  more  gen¬ 
eral  since  (12)  is  valid  only  when  all  of  the  features  are  ME 
estimators  and  only  holds  asymptotically  for  large  data  records 
with  the  implication  that  0  tends  to  Gaussian,  whereas  (9)  has  no 
such  implication.  This  is  particularly  important  in  upstream  pro¬ 
cessing,  where  there  has  not  been  significant  data  reduction,  and 
asymptotic  results  do  not  apply.  Using  (9),  we  can  make  simple 
adjustments  to  the  reference  hypothesis  to  match  the  data  better 
and  avoid  the  PDF  tails  (such  as  controlling  variance),  where  we 
are  certain  that  we  remain  in  the  ROS  of  z.  As  an  aside,  we  note 
that  (7)  with  a  fixed  reference  hypothesis  is  even  more  general 
since  there  is  no  implied  sufficiency  requirement  for  z. 


D.  Chain  Rule 

In  many  cases,  it  is  difficult  to  derive  the  J-function  for  an 
entire  processing  chain.  On  the  other  hand,  it  may  be  quite  easy 
to  do  it  for  one  stage  of  processing  at  a  time.  In  this  case,  the 
chain  rule  can  be  used  to  good  advantage.  The  chain  rule  is  just 
the  recursive  application  of  the  PDF  projection  theorem.  For 
example,  consider  a  processing  chain 


Ti(x)  T2(y) 

x  *  y  *  w 


T3(  w) 

-a  z 


(13) 


Jfc(x \m)  =  J(x,Ti,fl'0(y))J(y,T2,iTo(w)) 

J(w,Ts,  Hq  (z))  pz(z\Hi).  (15) 

There  is  a  special  embedded  relationship  between  the  hy¬ 
potheses.  Fet  'Hy,  Tiw,  and  7 7Z  be  the  ROSs  of  y,  w,  and 
z,  respectively.  Then,  we  have  7 -iz  C  7 fw  C  7 iy.  If  we  use 
variable  reference  hypotheses,  we  also  must  have  Hq(z)  G  77-., 
H'0(w)  F  77,,, ,  and  770(y)  G  Hy.  This  embedding  of  the 
hypotheses  is  illustrated  in  Fig.  1.  The  condition  Hi  G  77. 
is  the  ideal  situation  and  is  not  necessary  to  produce  a  valid 
PDF  The  factorization  (14),  together  with  the  embedding  of 
the  hypotheses,  we  call  the  chain-rule  processor  (CRP). 

III.  Types  of  J-Functions 

We  now  summarize  the  various  methods  we  have  discussed 
for  computing  the  J-function. 

A.  Fixed  Reference  Hypothesis 

For  modules  using  a  fixed  reference  hypothesis,  care  must 
be  taken  in  calculation  of  the  J-function  because  the  data  is 
more  often  than  not  in  the  tails  of  the  PDF.  For  fixed  reference 
hypotheses,  the  J  function  is 


J(x,T,770)  =  fe(x|^o).  (16) 

Pz(z\Ho) 

The  numerator  density  is  usually  of  a  simple  form,  so  it  is  known 
exactly.  The  denominator  density  pz(z\H{f)  must  be  known  ex¬ 
actly  or  approximated  carefully  so  that  it  is  accurate  even  in 
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the  far  tails  of  the  PDF.  The  saddlepoint  approximation  (SPA), 
which  was  described  in  a  recent  publication  [15],  provides  a  so¬ 
lution  for  cases  when  the  exact  PDF  cannot  be  derived  but  the 
exact  moment-generating  function  (MGF)  is  known.  The  SPA 
is  known  to  be  accurate  in  the  far  tails  of  the  PDF  [15], 

Example  2:  As  a  very  simple  example  of  a  fixed-reference 
module,  let  x  be  a  time-series,  and  let  z  be  the  power  estimate 


JV 


=  kY.*1 


n= 1 


For  Hq  being  WGN,  px(x.\Ho )  is  quite  simple  to  write,  namely 


N 

logp*(x|i?o)  =  log(27r) 


N 

£4 

n= 1 


(17) 


Clearly,  z  is  a  Chi-square  RV  with  N  degrees  of  freedom  scaled 
by  1/N.  Thus 


from  a  uniform  distribution  in  the  [0,100]  range.  The  following 
results  were  produced. 


Fixed  ref 

-5.658  667  5e  + 03 
-3.559  442  le  +  03 
-5.228  754  2e  + 03 
-5.186  465  0e  + 03 
-4.969  499  2e  +  03 
-4.184  531  le  + 03 
-5.693948  5e  +  03 
-5.656  036  5e  + 03 
-5.691 540  8e  + 03 
-5.267  565  5e  + 03 


log  J  function - 

Variable  ref 
-5.658  667  7e  + 03 
-3.559442  2e  +  03 
—5.228  754  4e  +  03 
-5.186465  2e  +  03 
-4.969  499  3e  + 03 
-4.184  531 3e  +  03 
-5.693948  7e  +  03 
-5.656  036  7e  + 03 
-5.691 541  Oe  +  03 
-5.267  565  6e  + 03 


There  is  almost  no  difference  between  the 
0.000  16  error  in  log  domain).  The  error  rises 
because  the  CLT  approximation  worsens. 


error 

-0.000  166666 
-0.000 166667 
-0.000 166667 
-0.000 166667 
-0.000 166666 
-0.000  166667 
-0.000  166667 
-0.000 166667 
-0.000 166667 
-0.000 166667 
approaches  (a 
as  N  decreases 


C.  Maximum  Likelihood  Modules 


N 

log p(z\H0)  =  log IV  -  log  (  T  (  — 


Ap 

7T  )  l0g(2) 


f  N  \  Nz 

+  ^-ljlog(Nz)-^.  (18) 


B.  Variable  Reference  Hypothesis  Modules 

For  a  variable  reference  hypotheses,  the  J  function  is 


J(x,T,  flo(z)) 


px(x.\H0(z)) 
pz(z\H0(z)) ' 


(19) 


Modules  using  a  variable  reference  are  usually  designed  to  po¬ 
sition  the  reference  hypothesis  at  the  peak  of  the  denominator 
PDF,  which  is  approximated  by  the  CLT. 

Example  3:  We  can  use  the  Example  2  and  redesign  the 
module  as  a  variable  reference  module.  Now,  instead  of  using 
reference  Ho,  we  use  the  reference  hypothesis  Hq(z )  that  x 
has  variance  cr 2  =  2.  Thus 


logpx(x\H0(z)) 


-y  log(27T2)  - 


(J>-) 

(2  z) 


(20) 


Now,  2  will  still  be  Chi-square,  but  we  can  approximate  its  PDF 
by  the  CLT.  Accordingly,  z  has  mean  a2  =  z  and  variance 
2a4/N  =  2 z2  jN.  Thus 


\ogp{z\HQ{z))  ~  - 


(z  -  z)2 

m 


_ 


(21) 


A  special  case  of  the  variable  reference  hypothesis  approach 
is  the  ML  method,  when  z  is  an  MLE  (see  Section  II-C) 

J(x,  T,  Ho)  = - 

(27r)-p/2|I(0)|1/2 

To  continue  Examples  2  and  3,  it  is  known  that  the  ML  estimator 
for  variance  is  the  sample  variance  which  has  a  Cramer-Rao 
(CR)  bound  of  <Tr2njn  =  2 o4/N.  Applying  (12),  we  get  exactly 
the  same  result  as  the  above  variable  reference  approach.  When¬ 
ever  the  feature  is  also  a  ML  estimate  and  the  asymptotic  re¬ 
sults  apply  (the  number  of  estimated  parameters  is  small  and 
the  amount  of  data  is  large),  the  two  methods  are  identical.  The 
variable  reference  hypothesis  method  is  more  general  because 
it  does  not  need  to  rely  on  the  CLT. 

D.  One-to-One  Transformations 

One-to-one  transformations  do  not  change  the  information 
content  of  the  data,  but  they  are  important  for  feature  condi¬ 
tioning  prior  to  PDF  estimation.  Recall  from  Section  II  that 
Theorem  1  is  a  generalization  of  the  change-of-variables  the¬ 
orem  for  1:1  transformations.  Thus,  for  1:1  transformations,  the 
J-function  reduces  to  the  absolute  value  of  the  determinant  of 
the  Jacobian  matrix  (4) 

J(x,T)  =  |JT(x)|. 

Our  first  example  is  the  log  transformation  that  is  useful  when 
applied  to  exponential  RVs  to  obtain  a  more  “Gaussian-like” 
distribution. 

Example  4 — Log  Transformation:  Let  2  =  log(a;).  We  have 
dy/dx  =  1/a;;  thus,  log  J  =  log(l/a;)  =  —  log  a:  =  —z.  For 
vector  arguments 

N 

log  J  =~Y1  Zi‘ 
i= 1 


Notice  the  complete  cancellation  of  the  last  term. 

Let  us  compare  the  fixed  hypothesis  method  (17)  and  (18) 
with  the  variable  hypothesis  method  (20)  and  (21)  for  the  power 
feature.  We  create  input  data  x  from  iid  samples  of  Gaussian 
noise  but  with  a  random  scaling.  The  scale  factor  was  chosen 


A  very  important  one-to-one  transformation  in  signal  pro¬ 
cessing  is  the  conversion  from  autocorrelation  function  (ACF) 
to  reflection  coefficients  (RCs)  using  the  Levinson  algorithm 
[16].  RCs  tend  to  be  better  features  since  they  are  less  correlated 
than  ACF  estimates. 
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Input 
Raw  Data 


Fig.  2.  Block  diagram  of  a  class-specific  classifier. 


Example  5 — Conversion  From  ACF  to  RCs:  Let  z  =  T(r), 
where  r  =  [ro,rr . . .  rp]1  and  z  =  [ro,  k\  . . .  kp]',  where 
{fci . . .  kp}  are  the  P  RCs.  The  Jacobian  is 

\JT\=roPf[(l-k?)p-\ 

2—1 


Although  the  RCs  are  uncorrelated,  they  are  subject  to  the 
limit  \ki\  <  1,  which  gives  their  distribution  a  discontinuity.  To 
obtain  more  Gaussian  behavior,  the  log-bilinear  transformation 
is  recommended  (thanks  to  S.  Kay). 

Example  6 — Log  Bilinear  Transformation:  Let 


z  log(l  -  kj) 

*  log(l  +  ki)  ’ 


1  <  i  <  P. 


we  implement  the  classical  Neyman-Pearson  classifier  but  with 
the  class  PDFs  factored  using  the  PDF  projection  theorem 


3 


=  argmax  ^ ))  p2(Z;/ 1 Hj )  at  Zj  =  T^x) 


7  Pz(z0\ H0ij(Xj)Y 


(22) 


where  we  have  allowed  for  class-dependent,  data-dependent, 
reference  hypotheses. 

The  chain-rule  processor  (14)  is  ideally  suited  to  classifier 
modularization.  Fig.  2  is  a  block  diagram  of  a  class-specific  clas¬ 
sifier.  The  packaging  of  the  feature  calculation  together  with  the 
J-function  calculation  is  called  the  class-specific  module.  Each 
arm  of  the  classifier  is  composed  of  a  series  of  modules  called 
a  “chain.” 


B.  Feature  Selectivity:  Classifying  Without  Training 


We  have 


|Jt|  =  j] 


2=1 


Additionally,  taking  the  log  of  the  first  feature  (ro)  results  in 
a  further  improvement. 


IV.  Application  of  Theorem  1  to  Classification 
A.  Classifier  Architecture 

Application  of  the  PDF  projection  theorem  to  classification 
is  simply  a  matter  of  substituting  (9)  into  (1).  In  other  words. 


The  J-function  and  the  feature  PDF  provide  a  factorization  of 
the  raw  data  PDF  into  trained  and  untrained  components.  The 
ability  of  the  J-function  to  provide  a  “peak”  at  the  “correct” 
feature  set  gives  the  classifier  a  measure  of  classification  per¬ 
formance  without  needing  to  train.  In  fact,  it  is  not  uncommon 
that  the  J-function  dominates,  eliminating  the  need  to  train  at 
all.  This  we  call  th s  feature  selectivity  effect.  For  a  fixed  amount 
of  raw  data,  as  the  dimension  of  the  feature  set  decreases,  indi¬ 
cating  a  larger  rate  of  data  compression,  the  effect  of  the  J-func- 
tion  compared  with  the  effect  of  the  feature  PDF  increases.  An 
example  where  the  J-function  dominates  is  a  bank  of  matched 
filter  for  known  signals  in  noise.  If  we  regard  the  matched  filters 
as  feature  extractors  and  the  matched  filter  outputs  as  scalar  fea¬ 
tures,  it  may  be  shown  that  this  method  is  identical  to  comparing 
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only  the  J-functions.  Let  z:t  —  |w'x|2,  where  wy  is  a  nor¬ 
malized  signal  template  such  that  w'w j  =  1.  Then,  under  the 
white  (independent)  Gaussian  noise  (WGN)  assumption,  Zj  is 
distributed  y  2 ( 1 ) •  It  is  straightforward  to  show  that  the  ./-func¬ 
tion  is  a  monotonically  increasing  function  of  zj.  Signal  wave¬ 
forms  can  be  reliably  classified  using  only  the  J-function  and 
ignoring  the  PDF  of  Zj  under  each  hypothesis.  The  curse  of  di¬ 
mensionality  can  be  avoided  if  the  dimension  of  z j  is  small  for 
each  j.  This  possibility  exists,  even  in  complex  problems,  be¬ 
cause  zj  is  required  only  to  have  information  sufficient  to  sepa¬ 
rate  class  Hj  from  a  specially  chosen  reference  hypothesis  Hpj. 

C.  J -Function  Verification 

One  thing  to  keep  in  mind  is  that  it  is  of  utmost  importance 
that  the  J-function  is  accurate  because  this  will  insure  that  the 
resulting  projected  PDF  is,  in  fact,  a  valid  PDF.  For  example, 
if  the  J-function  is  accidentally  scaled  by  a  large  positive  con¬ 
stant,  the  classifier  will  produce  false  classifications  in  favor  the 
the  class  with  the  erroneous  J-function.  In  contrast,  it  is  not 
a  serious  problem,  however,  if  one  of  the  likelihood  functions 
p(x\Hj)  is  not  a  perfect  match  to  the  data  for  class  Hj  because 
it  will  be  discovered  by  trial  and  error.  A  better  PDF  estimate 
can  be  found  simply  by  comparing  the  likelihood  values  for 
the  given  class.  Therefore,  in  the  following  examples,  we  are 
not  very  strictabout  the  sufficiency  of  the  features  for  the  corre¬ 
sponding  target  class,  although  their  approximate  sufficiency  is 
intuitively  apparent.  The  ultimate  justification  for  using  a  partic¬ 
ular  feature  set  can  be  the  maximization  of  the  likelihood  values 
calculated  on  the  raw  data  space  using 


=  J(x,  Tj,H0)p(zj\Hj).  (23) 


We  can  compare  competing  feature  sets  based  on  the  likelihood 
values  and  can  gradually  increase  the  likelihood  on  the  target 
class  by  experimenting  with  different  features  and  PDF  models. 

To  verify  the  “J”  function,  we  have  developed  an  end-to-end 
test  that  we  call  the  “Acid  Test”  because  of  its  foolproof  nature. 
To  use  the  method,  it  is  first  necessary  to  define  a  fixed  hypoth¬ 
esis,  which  is  denoted  by  Hs,  for  which  we  can  compute  the 
PDF  p(x\Hs)  readily  and  for  which  we  can  synthesize  raw  data. 
Note  that  Hs  is  not  a  reference  hypothesis.  The  synthetic  data  is 
converted  into  features,  and  the  PDF  p(z\Hs)  is  estimated  from 
the  synthetic  features  (using  a  Gaussian  Mixture  PDF,  HMM, 
or  any  appropriate  statistical  model).  Next,  the  theoretical  PDF 
p(x\Hs)  is  compared  with  the  projected  PDF 


p(x\Hs)  =  J(x,T,H0)  p(z\Hs) 


for  each  sample  of  synthetic  data.  The  log-PDF  values  are 
plotted  on  each  axis,  and  the  results  should  fall  on  the  X  =  Y 
line.  For  each  example,  we  will  provide  acid  test  results.  Since 
the  acid  test  checks  the  equality  of  two  entirely  different  paths, 
it  should  find  any  systematic  error  in  PDF  estimation  or  in  the 
J-function  calculation. 


V.  Example:  Linear  Functions  of  Exponential, 
Chi-Square,  or  Log-Exponential  RVs 

A  widely  used  combination  of  transformations  in  signal  pro¬ 
cessing  is  to  first  apply  an  orthogonal  linear  transformation,  per¬ 
form  a  squaring  operation  (or  magnitude- squared  for  complex 
RVs),  and  then  perform  a  linear  transformation.  These  transfor¬ 
mations  include  widely  used  features  such  as  MEL  cepstrum 
[17],  polynomial  fits  to  power  series  and  power  spectra,  auto¬ 
correlation  functions  and,  through  one-to-one  transformations, 
autoregressive  (AR)  and  reflection  coefficients  (RC). 

The  general  form  is  the  following.  Let  x  be  an  TV-by-1  real 
or  complex  vector.  Let  u  =  LJ^x  be  some  real  or  complex 
orthogonal  linear  transformation  such  that  U^U  =  vl.  Note 
that  U  does  not  need  to  be  square  if  x  is  real  and  u  is  complex 
since  we  omit  any  redundant  elements  of  u.  Let  n  be  the  length 
of  u.  For  the  case  of  DFT  of  a  real  vector,  n  =  N / 2  +  1.  Next, 
let  y  be  the  vector  whose  elements  are  the  magnitude  squared 
(if  complex)  or  squared  (if  real)  values  of  the  elements  of  u 

Vi  =  K|2,  0  <  i  <  n  —  1, 


Finally,  let 


z  =  A'y  (24) 

where  A  is  a  real  n-by-M  matrix. 

A.  Two  Approaches  to  Computing  the  J -Function 

For  the  features  in  (24),  there  is  no  closed-form  solution  to 
the  J-function,  except  in  some  simple  cases  [15].  There  are, 
however,  two  very  good  approximations  discussed  in  the  next 
sections.  The  second  method  (central  limit  theorem)  will  be  used 
in  the  subsequent  example. 

1 )  Saddlepoint  Approximation  Method:  The  saddlepoint  ap¬ 
proximation  (SPA)  was  discussed  in  a  previous  publication  [15]; 
therefore,  we  will  only  give  an  overview  here.  In  the  referenced 
paper,  the  case  of  autocorrelation  coefficients  computed  from 
a  real  vector  x  was  discussed.  The  reference  hypothesis  used 
for  this  approach,  which  is  denoted  by  Hp,  is  white  (indepen¬ 
dent)  Gaussian  noise  of  zero  mean  and  variance  1 .  The  numer¬ 
ator  PDF  of  the  J-function 


J(x,  T,Hq) 


px(x\H0) 

pz(z\H0) 


(25) 


is  known  exactly,  and  the  denominator  PDF  is  approximated  by 
the  SPA.  In  extreme  cases,  this  approach  can  potentially  suffer 
from  the  “tail  PDF  problem.” 

To  appreciate  the  tail  PDF  problem,  one  can  imagine  that  for 
a  given  sample  x,  as  we  scale  x  by  a  large  positive  number  K 
so  that  when  calculating  J(Kx,T,Hp),  we  will  quickly  reach 
a  point  where  the  J-function  will  be  a  ratio  of  two  numbers  that 
are  essentially  zero  and  cannot  be  reliably  computed.  In  prac¬ 
tice,  we  find  that  if  all  calculations  are  made  in  the  log-domain, 
the  log- J  function  is  well-behaved  for  very  large  input  values. 
There  are  limits,  however,  and  we  find  that  the  SPA,  which  is  a 
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recursive  search  for  the  saddlepoint  itself,  will  eventually  have 
convergence  problems.  To  alleviate  this  problem,  we  use  a  vari¬ 
able  reference  hypothesis  (see  Section  II-B).  Let  v  be  a  rough 
estimate  of  the  variance  of  x.  Let  Hv  be  the  hypothesis  that  the 
input  variance  equals  v.  Assuming  II,  and  Ho  are  in  the  ROS 
of  z,  (25)  is  theoretically  independent  of  Hv,  and  thus 

px(x\H0)  _  px(x\Hv) 

Pz(z\H0)  pz(z\Hv)  ‘ 

However 

Pz{z\Hv)  =  v~M  pz(v~1z\H0) 
where  M  is  the  dimension  of  z.  Therefore 

J(x,  T,H0)=  vM  (26) 

Pz  (z/v\H0) 

which  provides  a  convenient  way  to  normalize  z  prior  to  calcu¬ 
lating  the  SPA. 

2)  CLT  Method:  The  second  method  that  gives  us  a  work¬ 
able  solution  is  the  CLT.  We  use  the  chain  rule  to  separately  an¬ 
alyze  the  two  stages:  a)  orthogonal  transformation  and  squaring 
and  b)  linear  transformation.  We  will  design  a  two-module  chain 
for  a  subset  of  the  autocorrelation  function  (ACF)  estimates.  The 
processing  chain  necessary  to  compute  the  ACF  coefficients  can 
be  broken  down  into  two  stages: 

1)  Compute  y,  which  are  the  magnitude- squared  FFT  bins. 

2)  Compute  z,  which  is  a  subset  of  the  elements  of  IFFT  (y), 
which  is  the  real  part  of  the  inverse  FFT  of  y. 


1 )  Features  and  Region  of  Sufficiency:  Let  y  be  the  length 
N/ 2+1  vector  of  magnitude-squared  bins  of  the  DFT  of  x. 

y  =  bo,  Vi  •  •  •  Vn/2]' 

where 

N  2  N 

Vk=  ,  0</,:< 

2—1 

The  ROS  of  y  is  quite  broad,  encompassing  all  Gaussian  pro¬ 
cesses  with  a  power  spectrum. 

2 )  Reference  Hypothesis:  For  our  reference  hypothesis  for 
this  stage,  we  use  Hq,  which  is  the  standard  normal  density 
(WGN  hypothesis  with  unit  variance). 

3 )  Input  PDF:  We  have 

px(x\H0)  =  (27t)_Ar/2exp  j-i  j  •  (27) 

4)  Output  PDF:  Note  that  under  Ho,  y  is  a  set  of  indepen¬ 
dent  RVs.  It  is  easily  shown  that  yo,  Pn/2  obey  the  y2(l)  density 
with  mean  N  and  variance  IN2.  In  addition,  y\  ■  ■  ■  yN/ 2_i  obey 
the  y2(2)  or  exponential  density  with  mean  N  and  variance  N2. 
Thus 

JV/2 

Py(y\Ho)  =  n  Py  (yi\H0)  (28) 

i=0 

where 


B.  Structure  of  the  Examples 

As  explained  previously,  a  class-specific  classifier  can  be  or¬ 
ganized  into  “modules”.  Each  module  consist  of  a  feature  trans¬ 
formation  and  a  J-function  calculation.  The  J-function  requires 
the  definition  of  a  reference  hypothesis  and  the  calculation  of  the 
numerator  (input)  and  denominator  (output)  PDF.  Accordingly, 
we  organize  this  example  and  those  that  follow  into  modules. 
For  each  module,  we  explain  the  following. 

1)  Features  and  ROS.  We  describe  the  feature  transforma¬ 
tion  z  =  T(x)  and  the  ROS  for  the  features  (see  Sec¬ 
tion  II-B).  Ideally,  the  ROS,  which  is  denoted  by  Tiz,  in¬ 
cludes  the  “target  class”  Hi  for  which  this  feature  set  is 
designed. 

2)  Reference  Hypothesis.  We  define  the  reference  hypoth¬ 
esis  Hq  used  in  the  J-function.  Often,  this  hypothesis  is 
a  data-dependent  reference,  which  is  written  /7q(z). 

3)  Input  PDF.  We  define  this  as  the  numerator  of  the  J 
function. 

4)  Output  PDF.  This  is  the  denominator  of  the  J-function. 

5)  Test  Results.  When  appropriate,  we  present  results  of  the 
“acid  test”  (Section  IV-C). 


(29) 

and 

py(yi\H0)  =  ^  exp  {-^}  1  <  *  <  y  -  1  (30) 

and  v  is  the  mean  of  the  elements  of  y  (v  =  N). 

D.  Stage  2:  Linear  Transformation 

Stage  2  of  the  two-stage  CLT  approach  is  discussed  here.  In 
stage  2,  we  apply  a  linear  transformation  to  y.  We  use  ACF  as 
an  example,  but  the  basic  method  applies  to  any  linear  transfor¬ 
mation. 

1 )  Features  and  Region  of  Sufficiency:  We  let  z  be  the  first 
P  +  1  circular  ACF  samples 

1  N 

Zi  ^  '  J'nd'  n  •••/]  0  f.  i  P  (31) 

n=l 


where  [n  +  i\  is  taken  modulo-iV.  We  use  the  circular  ACF  esti- 
C.  Stage  1:  DFT  Magnitude-Squared  mates  in  this  example  for  simplicity  because  they  may  be  written 

Stage  1  of  the  two-stage  CLT  approach  is  discussed  here.  in  terms  of  y,  but  the  J-function  may  be  found  for  any  variety 
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of  ACF  estimate.  The  features  (31)  may  be  written  in  terms  of 
y  as  follows: 


nential  and  y2  RVs  with  means  corresponding  to  { yz } ,  which 
are  the  elements  of  yz .  Specifically 


Zi 


1 

TV2 


yk  cos{^^}’  0  <  i  <  P- 


(32) 


This  has  a  compact  matrix  notation 
z  =  A'y 


where  A  is  the  (TV/ 2  +  l)-by-(P  +  1)  matrix  defined  by 


* ij 


2  , 

(  Zitij  \ 

7V2  cos  1 

Ut 

1  , 

(  2  ttij  \ 

N2cos\ 

[  N  j 

TV 

,  0<j<P,  1  <  *  <  y  -  1 


(33) 


TV 


,  0<j<P,  7  =  0,-.  (34) 


Since  z  is  the  ACF  estimates  of  order  P,  the  approximate  ROS 
is  all  AR  processes  of  order  P  and  less. 

2 )  Reference  Hypothesis:  Because  we  intend  to  use  the  CLT 
to  approximate  the  J-function  denominator,  we  need  to  use  a 
variable  reference  hypothesis  Hq(z)  such  that  the  mean  of  z 
under  Hq(z)  is  equal  to  or  close  to  z  itself.  There  are  two  pos¬ 
sible  methods.  For  arbitrary  matrices  A,  this  can  be  done  by 
projecting  the  input  vector  upon  the  column  space  of  A.  Let 
Ho(z)  be  the  hypothesis  that  y  has  mean 

y2  =  A(A/A)-1z.  (35) 


Notice  that  A;  yz  =  z,  that  is,  under  /7(l(z),  the  mean  of  z  is  z 
itself. 

One  possible  problem  that  can  occur  is  if  yz  in  (35)  happens 
to  be  negative,  which  is  quite  possible,  but  not  allowed.  A  suit¬ 
able  solution  is  to  use  a  constrained  optimization,  that  is,  choose 
yz  such  that  it  is  positive,  and  A'  yz  is  as  close  as  possible  to  y. 

A  more  satisfying  way  to  guarantee  a  positive  yz  in  the  case 
of  ACF  is  the  following.  We  let  Hq(z)  be  the  hypothesis  that  y 
obeys  the  AR  spectrum  corresponding  to  z.  Thus,  we  must  use 
the  Levinson  algorithm  to  solve  for  the  Pth-order  AR  coeffi¬ 
cients  erg,  a.  If  A(k)  is  the  DFT  of  a  (padded  to  length  TV),  then 


Pz{k) 


\m\ 5 


is  the  AR  spectrum  corresponding  to  a.  We  let 


yz  =  [P“  (0) . . .  P2 


(36) 


For  large  TV,  we  have  A'  yz  — ►  z. 

3)  Input  PDF:  We  need  to  evaluate  py(y\Ho(z))  and 
Pz(z\Hq(z)).  We  assume  { y are  a  set  of  independent  expo- 


Pv(yi\H0{z))  =- 


f  y/2n 


( 

Wi) 


-1/2 


x  exp  <  — 


2  v! 


and 


py(y,\H0(z))  =  exp  {- j)  t 


TV 

*  =  °.y 


~  ~  2 


(37) 


(38) 


4)  Output  PDF:  Because  Hq{z)  is  “close”  to  z,  we  approx¬ 
imate  p,(z|iTo(z))  by  the  central  limit  theorem  (CLT).  Under 
H0(z),  the  elements  of  y  are  independent  with  mean  yz  and  di¬ 
agonal  covariance  £2,  which  are  defined  by 


')  A  ^((y/,-t/|)2|//o(z)) 


2(yf)\  i  =  0,f 

(yif,  i<*<  f-i- 


We  can  then  easily  compute  the  mean  and  covariance  of  z: 


zz  =  £(z\H0(z))  =  A'yz 


and 


S::  =A'53*  A. 

logp2(z|P0(z))  =  -  log(2?r)  -  i  log  |det(£f)| 

-  liz-zozYCZ-yHz-zoz) 

-  ~  ^  log(^27r)  _  \ log  ldet(S^)l 

(39) 


where  in  the  last  step,  we  make  the  approximation  zz  ~  z. 
This  approximation  becomes  better  as  TV  becomes  larger.  Note 
also  that  the  method  just  described  is  closely  related  to  the  ML 
approach.  In  fact,  £|  is  related  to  the  Fisher’s  information  of 
the  ACF  estimates  [16], 

E.  Test  Results 

The  acid  test  was  run  on  the  ACF  features  using  both  the  SPA 
and  CLT  methods.  A  model  order  of  2  was  used  giving  a  feature 
dimension  of  3  (lags  0  through  2).  Results  are  shown  in  Figs.  3 
and  4.  A  raw  data  size  of  TV  =  32  was  used  with  a  test  hypoth¬ 
esis,  Hs  of  iid  Gaussian  noise  of  variance  100.  There  were  400 
samples  of  synthetic  data  used  for  training  the  feature  PDF  using 
a  Gaussian  mixture.  The  results  show  that  both  methods  “pass” 
the  test  because  the  estimates  of  projected  PDF  (vertical  axis) 
appear  to  track  the  theoretical  PDF  values  (horizontal  axis).  The 
errors  are  quite  small,  considering  that  these  are  PDF  estimates 
of  a  32-dimensional  PDF.  A  comparison  was  made  of  the  dif¬ 
ference  of  the  log  J-function  values  output  by  the  two  methods, 
and  it  was  found  that  the  difference  was  less  than  1 .0  for  all  sam¬ 
ples. 
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Acid  test  for  chain  ar_2_32 


-135  -130  -125  -120  -115  -110  -105 


Theoretical  log-PDF 


-135  -130  -125  -120  -115  -110  -105 

Theoretical  log-PDF 

Fig.  3.  Acid  test  results  for  autocorrelation  function  using  SPA  method.  Top 
frame  shows  the  estimate  of  the  feature  log-PDF  projected  to  the  raw  data  plotted 
against  the  theoretical  log-PDF.  The  bottom  frame  shows  the  difference  plotted 
against  the  theoretical  log-PDF.  A  Gaussian  mixture  was  used  to  estimate  the 
feature  PDF. 


Acid  test  for  chain  ar_2_32 


-135  -130  -125  -120  -115  -110  -105 


Theoretical  log-PDF 


_3l _ i _ i _ l _ i _ i _ I 

-135  -130  -125  -120  -115  -110  -105 


Theoretical  log-PDF 


where  M  =  12  is  a  commonly  used  value.  The  MEL  cepstrum 
equals 

z  =  DCT(log(A'y)) 

where  the  log  function  operates  on  each  element  of  its  argu¬ 
ment,  and  DCT  is  the  discrete  cosine  transform.  Note  that  DCT 
and  log  are  both  1:1  transformations,  whose  J-functions  are  the 
determinant  of  the  respective  Jacobian  matrices.  Our  primary 
concern,  then,  is  to  analyze  the  intermediate  feature  set 

w  =  A'y 

which  we  have  previously  described  in  Section  V. 

An  important  warning  is  that  the  usual  MEL  filter  bank  does 
not  include  filters  centered  at  the  0-th  and  N/2  (Nyquist)  DFT 
bins.  These  two  filters  need  to  be  included  in  any  class-specific 
classifier;  otherwise,  z  will  not  be  sufficient  for  simple  scaling 
operations.  The  proper  way  to  eliminate  the  features  is  not  to 
exclude  them  from  the  MEL  filterbank,  but  rather  to  assign  a 
noninformative  (such  as  uniform)  PDF  to  them  at  the  output. 

VII.  Feature  Selection 

One  question  we  have  not  yet  covered  is  how  does  one  deter¬ 
mine  an  appropriate  feature  set  for  a  data  class?  Choosing  fea¬ 
tures  is  rarely  done  through  statistical  or  mathematical  analysis. 
The  choice  of  features  remains  an  art  requiring  intuition.  This 
intuition  is  is  helped  by  the  methods  of  resynthesis  and  model 
order/segment  size  selection  discussed  below. 

A.  Sufficiency  by  Resynthesis 

In  many  problems,  the  ability  of  a  human  to  classify  an  event 
exceeds  the  ability  of  the  machine.  Human  performance  is 
almost  always  a  lofty  goal.  It  is  therefore  reasonable  to  choose 
features  that  can  represent  the  data  with  enough  fidelity  to 
resynthesize  the  event  to  the  satisfaction  of  a  human  observer. 
For  example,  the  resynthesis  of  speech  data  from  features  has 
been  used  for  speech  analysis  to  determine  the  appropriateness 
of  speech  analysis  methods  [19].  We  recommend  this  method 
whenever  it  is  appropriate. 


Fig.  4.  Acid  test  results  for  autocorrelation  function  using  CLT  method.  Top 
frame  shows  the  estimate  of  the  feature  log-PDF  projected  to  the  raw  data  plotted 
against  the  theoretical  log-PDF.  The  bottom  frame  shows  the  difference  plotted 
against  the  theoretical  log-PDF.  A  Gaussian  mixture  was  used  to  estimate  the 
feature  PDF. 

VI.  Example:  Cepstrum  and  MEL  Cepstrum 

An  important  set  of  features  in  speech  analysis  is  cepstrum 
[18]  and  MEL  cepstrum  [17].  For  the  cepstrum,  the  SPA  for  the 
denominator  PDF  of  the  J-function  for  fixed  WGN  reference 
hypothesis  is  described  in  [15],  so  we  will  not  need  to  discuss  it 
further.  The  MEL  cepstrum,  however,  is  a  member  of  the  set  of 
transformations  in  Section  V.  The  MEL  cepstrum  is  computed 
as  follows.  Let  y  be  a  DFT  magnitude-squared  vector  of  length 
N/2  —  1,  where  N  is  the  DFT  size.  The  MEL  filter  bank  is  a 
matrix  A  of  spectral  template  vectors 

A  =  [cic2  . . .  cM] 


B.  Determination  of  Segmentation  and  Model  Order 

Once  a  feature  set  is  chosen,  it  may  be  possible  to  fine  tune  it. 
This  is  particularly  true  if  the  feature  extraction  is  governed  by 
a  set  of  parameters  such  as  segment  size  and  model  order.  Full 
implementation  of  (7a)  may  be  computationally  prohibitive  un¬ 
less  a  simplified  PDF  model  is  used.  The  method  now  presented 
may  be  a  way  to  automatically  determine  these  parameters. 

In  many  statistical  models,  there  are  two  parts  to  the  mod¬ 
eling:  measurement  PDF  and  spatio-temporal  distribution.  For 
example,  in  an  HMM,  the  state  PDFs  are  measurement  PDFs 
and  the  state  transition  matrix  describes  the  spatio-temporal 
component  of  the  model.  By  removing  the  spatio-temporal  part 
of  the  model,  a  simplified  model  results  (just  a  measurement 
PDF).  It  may  be  possible  to  optimize  the  feature  model  order 
and  segmentation  based  only  on  the  simplified  model.  The 
optimized  features,  it  is  conjectured,  would  achieve  the  highest 
likelihood  once  the  spatio-temporal  parts  of  the  model  were 
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restored.  We  have  conducted  many  experiments  that  support 
this  conjecture. 

The  particulars  of  the  method  are  now  presented.  Let  the  fea¬ 
ture  data  be  written  Zfc  =  {zk,  z§  . . .  z^fc },  where  k  is  a  par¬ 
ticular  choice  of  segment  size  and/or  model  order  and  Nk  is 
the  corresponding  total  number  of  observation  vectors  corre¬ 
sponding  to  choice  k.  Note  that  we  have  collected  all  the  avail¬ 
able  data  from  all  events  into  one  mass,  forgetting  the  temporal 
or  spatial  organization,  and  forgetting  which  event  the  obser¬ 
vations  are  from.  We  also  assume  that  zk  are  low  enough  in 
dimension  that  a  parametric  PDF  estimator  (i.e.,  Gaussian  mix¬ 
ture)  can  be  estimated  from  the  data.  Let  the  data  be  divided  into 
a  training  set  (Xtr,  Z^r)  and  testing  set  (Xte,  Z£e)  for  cross-val¬ 
idation.  Next,  we  estimate  the  PDF 

Pk( zk) 

using  Zf,.  for  model  choice  /,:.  The  feature  PDF  is  projected  to 
the  input  data  space  where  it  can  be  compared  across  different 
values  of  k.  We  have 

m = j(x,  zk) + y  iogpfc(z^) 

n 

where  log  J(X,  Zfc)  is  the  aggregate  log-  J-function  for  the  data 
set.  Next,  L(k )  is  calculated  for  (Xte,  Zfj,).  For  added  accuracy, 
L(k)  can  also  be  computed  by  swapping  Zf'r  and  Zke  and  aver¬ 
aging.  The  optimal  choice  of  k  is  that  which  maximizes  L(k). 

This  approach  is  robust  against  overparameterization  because 
as  the  model  order  (and  dimension  of  zfc)  increases  above  the 
optimal  value,  the  ability  to  estimate  the  PDF  worsens  and  the 
average  of  the  cross-validated  likelihood  will  begin  to  fall. 

C.  Example 

To  test  the  approach,  we  first  created  a  synthetic  signal  class 
approximating  a  “bang”  sound.  Independent  Gaussian  noise  is 
passed  through  a  second-order  autoregressive  filter.  The  filter 
output  is  modulated  by  an  envelope  function  with  an  instanta¬ 
neous  attack  and  an  exponential  decay.  The  attack  time  is  chosen 
at  random.  Independent  noise  is  added  to  the  result.  An  example 
of  a  typical  synthetic  event  is  shown  in  Fig.  5.  A  total  of  100 
events  were  created,  each  with  a  total  length  of  4096  samples. 
The  features  were  extracted  by  segmenting  the  events  into  seg¬ 
ments  of  length  N,  where  N  ranged  from  32  to  5 1 2  in  powers  of 
2.  Autocorrelation  features  of  order  P  were  extracted  from  each 
segment  where  P  was  between  2  and  7.  The  results  are  shown 
in  Table  I  and  show  a  peak  at  P  =  4,  N  =  128,  which  is  about  a 
10-ms  segment  size.  This  is  in  agreement  with  intuition  because 
the  width  of  the  event  envelope  near  the  peak  is  about  10  ms. 

VIII.  Versatile  General-Purpose  Class-Specific 
Time-Series  Classifier  Using  Reflection  Coefficients 
and  HMM 

It  is  possible  to  use  the  material  thus-far  discussed  to  arrive  at 
a  fully  modular,  extremely  versatile  class-specific  classifier.  A 
functional  block-diagram  of  this  classifier  is  provided  in  Fig.  6. 
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Fig.  5.  Example  of  a  typical  synthetic  event.  Time-series  (top)  and 
spectrogram  (bottom).  Sample  rate  was  12  500  Hz. 


TABLE  I 

Results  of  Model  Order/Segment  Size  Selection  Experiment.  Results 
Log-Likelihoods  Relative  to  Maximum 


p 

N=32 

N=64 

N=T28 

N=256 

N=512 

2 

-7782.6 

-5105.3 

-5027.9 

-8336.7 

-16748.1 

3 

-6590.3 

-2033.2 

-921.5 

-2820.3 

-8650.7 

4 

-7491.6 

-1716.7 

0.0 

-1550.6 

-6672.8 

5 

-9316.0 

-2127.8 

-119.7 

-1499.8 

-6546.5 

6 

-11691.5 

-2731.7 

-262.8 

-1540.3 

-6480.7 

7 

-16159.1 

-5437.2 

-2169.5 

-3164.1 

-8080.1 

A  given  time-series  is  processed  by  each  class-model  to  ar¬ 
rive  at  a  raw-data  log-likelihood  for  the  class.  Each  block  la¬ 
beled  “RC(P)”  computes  the  reflection  coefficients  of  order  P 
from  the  associated  time-series  segment.  The  figure  shows  two 
class-models  employing  different  segmentation  lengths  as  well 
as  different  model  orders.  The  log-correction  terms  (log  ./-func¬ 
tions)  of  all  the  segments  are  added  together  and  the  aggregate 
correction  term  is  added  to  the  HMM  log-likelihood  (from  the 
forward  procedure  [20])  to  arrive  at  the  final  raw  data  log-like¬ 
lihood  for  the  class. 

Each  “RC(P)”  block  is  composed  of  a  series  of  modules  im¬ 
plementing  ACF  calculation  followed  by  conversion  to  RCs  and 
ending  with  feature  conditioning  by  the  log-bilinear  transforma¬ 
tion.  This  may  be  implemented  four  modules  corresponding  to 
Sections  V-C,  V-D,  and  III-D  (Examples  5  and  6).  Alternatively, 
the  SPA  approach  (see  Section  V-Al)  may  be  used  in  place  of 
the  first  two  modules  and  will  produce  virtually  identical  fea¬ 
tures  and  J-function  values.  This  classifier  has  the  added  benefit 
that  the  models  may  be  validated  by  re-synthesis  of  time-series 
from  features  (either  computed  from  actual  data  or  generated  at 
random  by  the  HMM).  Using  the  method  of  Section  VII-B,  the 
segmentation  sizes  and  model  orders  may  be  optimized  for  each 
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Time-series 


Log-likelihood 


Fig.  6.  Block  diagram  of  an  HMM  and  RC-based  class-specific  classifier.  A  given  time-series  is  processed  by  each  class-model  to  arrive  at  a  raw-data 
log-likelihood  for  the  class.  Each  block  labeled  “RC(P)”  computes  the  Pth  order  reflection  coefficients  from  the  corresponding  time-series  segment  and  is 
implemented  by  a  series  of  modules. 


class  individually,  eliminating  the  need  to  “compromise,”  and, 
because  it  is  a  class-specific  classifier,  features  of  any  kind  may 
be  used.  Adding  new  class  processors  will  not  affect  the  existing 
class  processors  or  their  training. 

IX.  Conclusions 

We  have  introduced  a  powerful  new  theorem  that  opens  up  a 
wide  range  of  new  statistical  methods  for  signal  processing,  pa¬ 
rameter  estimation,  and  hypothesis  testing.  Instead  of  needing  a 
common  feature  space  for  likelihood  comparisons,  the  theorem 
allows  likelihood  comparisons  to  be  made  on  a  common  raw 
data  space,  while  the  difficult  problem  of  PDF  estimation  can 
be  accomplished  in  separate  feature  spaces.  We  have  discussed 
the  recursive  application  of  the  theorem  which  gives  a  hierar¬ 
chical  breakdown  and  allows  processing  streams  to  be  analyzed 
in  stages.  Whereas  previous  publications  on  the  method  have 
relied  on  a  common  fixed  reference  hypothesis,  this  paper  has 
presented  the  use  of  class-dependent  and  data-dependent  refer¬ 
ence  hypotheses  and  has  explored  the  relationship  to  asymptotic 
maximum  likelihood  theory.  The  use  of  a  data-dependent  refer¬ 
ence  hypothesis  allows  two  new  methods  of  analyzing  the  fea¬ 
ture  sets  -  maximum  likelihood  (ML)  and  central  limit  theorem 
(CLT).  These  extensions  significantly  broaden  the  applicability 
of  the  method.  We  have  illustrated  the  use  of  the  approach  using 
common  feature  types  including  autoregressive  and  MEL  cep- 
strum  features.  We  have  also  presented  a  method  of  combined 
feature/model  order  selection  that  is  enabled  by  the  class-spe¬ 
cific  approach.  Finally,  we  have  provided  an  example  of  a  ver¬ 
satile  class-specific  classifier  using  autoregressive  features. 
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